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Abstract —In a multicore system, applications running on different cores 
interfere at main memory. This inter-application interference degrades 
overall system performance and unfairly slows down applications. Prior 
works have developed application-aware memory request schedulers to 
tackle this problem. State-of-the-art application-aware memory request 
schedulers prioritize memory requests of applications that are vulner¬ 
able to interference, by ranking individual applications based on their 
memory access characteristics and enforcing a total rank order. 

In this paper, we observe that state-of-the-art application-aware 
memory schedulers have two major shortcomings. First, such sched¬ 
ulers trade off hardware complexity in order to achieve high performance 
or fairness, since ranking applications individually with a total order 
based on memory access characteristics leads to high hardware cost 
and complexity. Such complexity could prevent the scheduler from 
meeting the stringent timing requirements of state-of-the-art DDR pro¬ 
tocols. Second, ranking can unfairly slow down applications that are 
at the bottom of the ranking stack, thereby sometimes leading to high 
slowdowns and low overall system performance. To overcome these 
shortcomings, we propose the Blacklisting Memory Scheduler (BLISS), 
which achieves high system performance and fairness while incurring 
low hardware cost and complexity. BLISS design is based on two new 
observations. First, we find that, to mitigate interference, it is sufficient to 
separate applications into only two groups, one containing applications 
that are vulnerable to interference and another containing applications 
that cause interference, instead of ranking individual applications with 
a total order. Vulnerable-to-interference group is prioritized over the 
interference-causing group. Second, we show that this grouping can 
be efficiently performed by simply counting the number of consecutive 
requests served from each application - an application that has a large 
number of consecutive requests served is dynamically classified as 
interference-causing. 

We evaluate BLISS across a wide variety of workloads and system 
configurations and compare its performance and hardware complexity 
(via RTL implementations), with five state-of-the-art memory schedulers. 
Our evaluations show that BLISS achieves 5% better system perfor¬ 
mance and 25% better fairness than the best-performing previous mem¬ 
ory scheduler while greatly reducing critical path latency and hardware 
area cost of the memory scheduler (by 79% and 43%, respectively), 
thereby achieving a good trade-off between performance, fairness and 
hardware complexity. 

1 Introduction 

In modern systems, the high latency of accessing large- 
capacity off-chip memory and limited memory bandwidth 
have made main memory a critical performance bottleneck. 
In a multicore system, main memory is typically shared by 
applications running on different cores (or, hardware contexts). 
Requests from such applications contend for the off-chip 
memory bandwidth, resulting in interference. Several prior 
works |l32l |27] [30l [311 demonstrated that this inter-application 
interference can severely degrade overall system performance 
and fairness. This problem will likely get worse as the number 
of cores on a multicore chip increases lIZTl . 

Prior works proposed different solution approaches to miti¬ 
gate inter-application interference, with the goal of improving 
system performance and fairness (e.g., ll^ [3T1 |28l [17] [T8| 
[To] [29] [l4] |7] [41] [161 mi mi). A prevalent solution direction 


is application-aware memory request scheduling (e.g., ll^ 
EHEsi [ni na m)- The basic idea of application-aware 
memory scheduling is to prioritize requests of different ap¬ 
plications differently, based on the applications’ memory ac¬ 
cess characteristics. State-of-the-art application-aware memory 
schedulers typically i) monitor applications’ memory access 
characteristics, ii) rank applications individually based on 
these characteristics such that applications that are vulnerable 
to interference are ranked higher and iii) prioritize requests 
based on the computed ranking. 

We observe that there are two major problems with past 
ranking-based schedulers. First, such schedulers trade off 
hardware complexity in order to improve performance or 
fairness. They incur high hardware complexity (logic and 
storage overhead as well as critical path latency) to schedule 
requests based on a scheme that ranks individual applications 
with a total order. As a result, the critical path latency and 
chip area cost of such schedulers are significantly higher 
compared to application-unaware schedulers. For example, 
as we demonstrate in Section |7.2| based on our RTL de¬ 
signs, TCM m, a state-of-the-art application-aware scheduler 
is 8x slower and 1.8x larger than a commonly-employed 
application-unaware scheduler, FRFCFS llTSl . Second, such 
schedulers not only increase hardware complexity, but also 
cause unfair slowdowns. When a total order based ranking is 
employed, applications that are at the bottom of the ranking 
stack get heavily deprioritized and unfairly slowed down. This 
greatly degrades system fairness. 

Our goal, in this work, is to design a new memory sched¬ 
uler that does not suffer from these two problems; one that 
achieves high system performance and fairness while incurring 
low hardware cost and low scheduling latency. To this end, 
we propose the Blacklisting memory scheduler (BLISS). Our 
BLISS design is based on two new observations. 

Observation 1. In contrast to forming a total rank order 
of all applications (as done in prior works), we find that, to 
mitigate interference, it is sufficient to i) separate applications 
into only two groups, one group containing applications that 
are vulnerable to interference and another containing appli¬ 
cations that cause interference, and ii) prioritize the requests 
of the vulnerable-to-interference group over the requests of 
the interference-causing group. Although one prior work, 
TCM ifTSll . proposed to group applications based on memory 
intensity, TCM ranks applications individually within each 
group and enforces the total rank order during scheduling. 
Our approach overcomes the two major problems with such 
schedulers that employ per-application ranking. First, separat- 
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ing applications into only two groups, as opposed to employing 
ranking based on a total order of applications, signihcantly 
reduces hardware complexity (Section [7.2| l. Second, since our 
approach prioritizes only one dynamically-determined group 
of applications over another dynamically-determined group, no 
single application is heavily deprioritized, improving overall 
system fairness (Section |7]i. 

Observation 2. We observe that applications can be 
efficiently classihed as either vulnerable-to-interference or 
interference-causing by simply counting the number of con¬ 
secutive requests served from an application in a short time 
interval. Applications with a large number of consecutively- 
served requests are classihed as interference-causing. The 
rationale behind this approach is that when a large number 
of consecutive requests are served from the same application, 
requests of other applications are more likely to be delayed, 
causing those applications to stall. On the other hand, applica¬ 
tions with very few consecutive requests will likely not delay 
other applications and are in fact vulnerable to interference 
from other applications that have a large number of requests 
generated and served. Our approach to classifying applications 
is simpler to implement than prior approaches (e.g., Eiiini 
fTSl ) that use more complicated metrics such as memory inten¬ 
sity, row-buffer locality, bank-level parallelism or long-term 
memory service as proxies for vulnerability to interference 
(Section |7.2| i. 

Mechanism Overview. Based on these two observations, 
our mechanism, the Blacklisting Memory Scheduler (BLISS), 
counts the number of consecutive requests served from the 
same application within a short time interval. When this count 
exceeds a threshold, BLISS places the application in the 
interference-causing group, which we also call the blacklisted 
group. In other words, BLISS blacklists the application such 
that it is deprioritized. During scheduling, non-blacklisted 
(vulnerable-to-interference) applications’ requests are given 
higher priority over requests of blacklisted (interference- 
causing) applications. No per-application ranking is employed. 
Prioritization is based solely on two groups as opposed to a 
total order of applications. 

This paper makes the following contributions; 

• We present two new observations on how a simple grouping 
scheme that avoids per-application ranking can mitigate 
interference, based on our analyses and studies of previous 
memory schedulers. These observations can enable simple 
and effective memory interference mitigation techniques 
including and beyond the ones we propose in this work. 

• We propose the Blacklisting memory scheduler (BLISS), 
which achieves high system performance and fairness while 
incurring low hardware cost and complexity. The key idea 
is to separate applications into only two groups, vulnerable- 
to-interference and interference-causing, and deprioritize 
the latter during scheduling, rather than ranking individual 
applications with a total order based on their access char¬ 
acteristics (like prior work did). 

• We provide a comprehensive complexity analysis of hve 
previously proposed memory schedulers, comparing their 
critical path latency and area via RTL implementations 
(Section 0. Our results show that BLISS reduces critical 
path latency/area of the memory scheduler by 79%/43% 


respectively, compared to the best-performing ranking-based 
scheduler, TCM ffSl . 

• We evaluate BLISS against hve previously-proposed mem¬ 
ory schedulers in terms of system performance and fairness 
across a wide range of workloads (Section [7.2| i. Our results 
show that BLISS achieves 5% better system performance 
and 25% better fairness than the best-performing previous 
scheduler, TCM JTSl . 

• We evaluate the trade-off space between performance, fair¬ 
ness and hardware complexity for hve previously-proposed 
memory schedulers and BLISS (Section |7.3| l. We demon¬ 
strate that BLISS achieves the best trade-off between per¬ 
formance, fairness and complexity, compared to previous 
memory schedulers. 

2 Background and Motivation 

In this section, we hrst provide a brief background on 
the organization of a DRAM main memory system. We then 
describe previous memory scheduling proposals and their 
shortcomings that motivate the need for a new memory sched¬ 
uler - our Blacklisting memory scheduler. 

2.1 DRAM Background 

The DRAM main memory system is organized hierarchi¬ 
cally as channels, ranks and banks. Channels are independent 
and can operate in parallel. Each channel consists of ranks 
(typically 1-4) that share the command, address and data 
buses of the channel. A rank consists of multiple banks that 
can operate in parallel. However, all banks within a channel 
share the command, address and data buses of the channel. 
Each bank is organized as a two-dimensional array of rows 
and columns. On a data access, the entire row containing 
the data is brought into an internal structure called the row 
buffer. Therefore, a subsequent access to the same row can be 
served from the row buffer itself and need not access the array. 
Such an access is called a row hit. On an access to a different 
row, however, the array itself needs to be accessed. Such an 
access is called a row miss/conflict. A row hit is served ~2-3x 
faster than a row miss/conflict ini. Eor more detail on DRAM 
operation, we refer the reader to lfT9l 12111^ IJTII . 

2.2 Memory Scheduling 

Commonly employed memory controllers employ a memory 
scheduling policy called Eirst Ready First Come First Served 
(FRFCFS) ll50llT5l that leverages the row buffer by prioritizing 
row hits over row misses/conflicts. Older requests are then 
prioritized over newer requests. FRFCFS aims to maximize 
DRAM throughput by prioritizing row hits. However, it un¬ 
fairly prioritizes requests of applications that generate a large 
number of requests to the same row (high-row-buffer-locality) 
and access memory frequently (high-memory-intensity) m 
1^ . Previous work (e.g., ||3^ |3T] [iTl [HI) proposed 
application-aware memory scheduling techniques that take into 
account the memory access characteristics of applications and 
schedule requests appropriately in order to mitigate inter¬ 
application interference and improve system performance and 
fairness. We will focus on four state-of-the-art schedulers, 
which we evaluate quantitatively in Section]^ 

Mutlu and Moscibroda propose PARBS Oil , an application- 
aware memory scheduler that batches the oldest requests from 
applications and prioritizes the batched requests, with the 
goals of preventing starvation and improving fairness. Within 
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each batch, PARES ranks individual applications based on the 
number of outstanding requests of each application and, using 
this total rank order, prioritizes requests of applications that 
have low-memory-intensity to improve system throughput. 

Kim et al. ini observe that applications that receive low 
memory service tend to experience interference from appli¬ 
cations that receive high memory service. Based on this ob¬ 
servation, they propose ATLAS, an application-aware memory 
scheduling policy that ranks individual applications based on 
the amount of long-term memory service each receives and 
prioritizes applications that receive low memory service, with 
the goal of improving overall system throughput. 

Thread cluster memory scheduling (TCM) m ranks in¬ 
dividual applications by memory intensity such that low- 
memory-intensity applications are prioritized over high- 
memory-intensity applications (to improve system through¬ 
put). Kim et al. irTSll also observed that ranking all applications 
based on memory intensity and prioritizing low-memory- 
intensity applications could slow down the deprioritized high- 
memory-intensity applications signihcantly and unfairly. With 
the goal of mitigating this unfairness, TCM clusters applica¬ 
tions into low and high memory-intensity clusters and employs 
a different ranking scheme in each cluster. In the low-memory- 
intensity cluster, applications are ranked by memory intensity, 
whereas, in the high-memory-intensity cluster, applications’ 
ranks are shuffled to provide fairness. Both clusters employ a 
total rank order among applications at any given time. 

More recently, Ghose et al. cni propose a memory sched¬ 
uler that aims to prioritize critical memory requests that stall 
the instruction window for long lengths of time. The scheduler 
predicts the criticality of a load instruction based on how long 
it has stalled the instruction window in the past (using the 
instruction address (PC)) and prioritizes requests from load 
instructions that have large total and maximum stall times 
measured over a period of time. Although this scheduler is not 
application-aware, we compare to it as it is the most recent 
scheduler that aims to maximize performance by mitigating 
memory interference. 

2.3 Shortcomings of Previous Scheduiers 

These state-of-the-art schedulers attempt to achieve two 
main goals - high system performance and high fairness. 
However, previous schedulers have two major shortcomings. 
First, these schedulers increase hardware complexity in order 
to achieve high system performance and fairness. Specihcally, 
most of these schedulers rank individual applications with 
a total order, based on their memory access characteristics 
(e.g., IM] Ha [nl [H). Scheduling requests based on a total 
rank order incurs high hardware complexity, as we demon¬ 
strate in Section |7.2| slowing down the memory scheduler 
signihcantly (by 8x for TCM compared to FRFCFS), while 
also increasing its area (by 1.8x). Such high critical path 
delays in the scheduler directly increase the time it takes to 
schedule a request, potentially making the memory controller 
latency a bottleneck. Second, a total-order ranking is unfair to 
applications at the bottom of the ranking stack. Even shuffling 
the ranks periodically (like TCM does) does not fully mitigate 
the unfairness and slowdowns experienced by an application 
when it is at the bottom of the ranking stack, as we show in 
Section [2 


Figure [T] compares four major previous schedulers using 
a three-dimensional plot with performance, fairness and sim¬ 
plicity on three different axeslHOn the fairness axis, we plot 
the negative of maximum slowdown, and on the simplicity 
axis, we plot the negative of critical path latency. Hence, 
the ideal scheduler would have high performance, fairness 
and simplicity, as indicated by the black triangle. As can 
be seen, previous ranking-based schedulers, PARES, ATLAS 
and TCM, increase complexity signihcantly, compared to the 
currently employed FRFCFS scheduler, in order to achieve 
high performance and/or fairness. 


Fairness 

(negative of —FRFCFS 



path latency) 

Fig. 1; Performance vs. fairness vs. simplicity 

Our goal, in this work, is to design a new memory scheduler 
that does not suffer from these shortcomings: one that achieves 
high system performance and fairness while incurring low 
hardware cost and complexity. To this end, we propose the 
Blacklisting memory scheduler (BLISS) based on two new 
observations described in the next section. 

3 Key Observations 

As we described in the previous section, several major state- 
of-the-art memory schedulers rank individual applications with 
a total order, to mitigate inter-application interference. While 
such ranking is one way to mitigate interference, it has 
shortcomings, as described in Section |2.3| We seek to over¬ 
come these shortcomings by exploring an alternative means to 
protecting vulnerable applications from interference. We make 
two key observations on which we build our new memory 
scheduling mechanism. 

Observation 1. Separating applications into only two 
groups (interference-causing and vulnerable-to-interference), 
without ranking individual applications using a total order, is 
sufficient to mitigate inter-application interference. This leads 
to higher performance, fairness and lower complexity, all at 
the same time. 

We observe that applications that are vulnerable to interfer¬ 
ence can be protected from interference-causing applications 
by simply separating them into two groups, one contain¬ 
ing interference-causing applications and another containing 
vulnerable-to-interference applications, rather than ranking in¬ 
dividual applications with a total order as many state-of-the-art 
schedulers do. To motivate this, we contrast TCM ifTSl . which 
clusters applications into two groups and employs a total rank 
order within each cluster, with a simple scheduling mechanism 
(Grouping) that simply groups applications only into two 
groups, based on memory intensity (as TCM does), and 

1. Results across 80 simulated workloads on a 24-core, 4-channel system. 
Section [^describes our methodology and metrics. 
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Fig. 2: Request service distribution over time with TCM and Grouping schedulers 
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prioritizes the low-intensity group without employing ranking 
in each group. Grouping uses the FRFCFS policy within 
each group. Figure shows the number of requests served 
during a 100,000 cycle period at intervals of 1,000 cycles, for 
three representative applications, astar, hmmer and Ibm from 
the SPEC CPU2006 benchmark suite Il40l . using these two 
schedulers]^ These three applications are executed with other 
applications in a simulated 24-core 4-channel system]^ 

Figure shows that TCM has high variance in the number 
of requests served across time, with very few requests be¬ 
ing served during several intervals and many requests being 
served during a few intervals. This behavior is seen in most 
applications in the high-memory-intensity cluster since TCM 
ranks individual applications with a total order. This ranking 
causes some high-memory-intensity applications’ requests to 
be prioritized over other high-memory-intensity applications’ 
requests, at any point in time, resulting in high interference. 
Although TCM periodically shuffles this total-order ranking, 
we observe that an application benefits from ranking only 
during those periods when it is ranked very high. These very 
highly ranked periods correspond to the spikes in the number 
of requests served (for TCM) in Figure for that application. 
During the other periods of time when an application is 
ranked lower (i.e., most of the shuffling intervals), only a 
small number of its requests are served, resulting in very slow 
progress. Therefore, most high-memory-intensity applications 
experience high slowdowns due to the total-order ranking 
employed by TCM. 

On the other hand, when applications are separated into only 
two groups based on memory intensity and no per-application 
ranking is employed within a group, some interference exists 
among applications within each group (due to the application- 
unaware FRFCFS scheduling in each group). In the high- 
memory-intensity group, this interference contributes to the 
few low-request-service periods seen for Grouping in Figure]^ 
However, the request service behavior of Grouping is less 
spiky than that of TCM, resulting in lower memory stall 
times and a more steady and overall higher progress rate 
for high-memory-intensity applications, as compared to when 
applications are ranked in a total order. In the low-memory- 
intensity group, there is not much of a difference between 
TCM and Grouping, since applications anyway have low mem¬ 
ory intensities and hence, do not cause significant interference 
to each other. Therefore, Grouping results in higher system 
performance and significantly higher fairness than TCM, as 
shown in Figure[^(across 80 24-core workloads on a simulated 
4-channel system). 

2. All these three applications are in the high-memory-intensity group. We 
found very similar behavior in all other such applications we examined. 

3. See Section 6 for our methodology. 



Fig. 3: Performance and fairness of Grouping vs. TCM 


Solely grouping applications into two also requires much 
lower hardware overhead than ranking-based schedulers that 
incur high overhead for computing and enforcing a total rank 
order for all applications. Therefore, grouping can not only 
achieve better system performance and fairness than ranking, 
but it also can do so while incurring lower hardware cost. 
However, classifying applications into two groups at coarse 
time granularities, on the order of a few million cycles, 
like TCM’s clustering mechanism does (and like what we 
have evaluated in Figure 0, can still cause unfair application 
slowdowns. This is because applications in one group would 
be deprioritized for a long time interval, which is especially 
dangerous if application behavior changes during the interval. 
Our second observation, which we describe next, minimizes 
such unfairness and at the same time reduces the complexity 
of grouping even further. 

Observation 2. Applications can be classified into 
interference-causing and vulnerable-to-interference groups by 
monitoring the number of consecutive requests served from 
each application at the memory controller. This leads to higher 
fairness and lower complexity, at the same time, than grouping 
schemes that rely on coarse-grained memory intensity mea¬ 
surement. 

Previous work actually attempted to perform grouping, 
along with ranking, to mitigate interference. Specifically, 
TCM m ranks applications by memory intensity and clas¬ 
sifies applications that make up a certain fraction of the 
total memory bandwidth usage into a group called the low- 
memory-intensity cluster and the remaining applications into a 
second group called the high-memory-intensity cluster. While 
employing such a grouping scheme, without ranking individ¬ 
ual applications, reduces hardware complexity and unfairness 
compared to a total order based ranking scheme (as we 
show in Figure |^, it i) can still cause unfair slowdowns 
due to classifying applications into groups at coarse time 
granularities, which is especially dangerous if application 
behavior changes during an interval, and ii) incurs additional 
hardware overhead and scheduling latency to compute and 
rank applications by long-term memory intensity and total 
memory bandwidth usage. 

We propose to perform application grouping using a signif¬ 
icantly simpler, novel scheme: simply by counting the number 
of requests served from each application in a short time 
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interval. Applications that have a large number (i.e., above a 
threshold value) of consecutive requests served are classihed as 
interference-causing (this classihcation is periodically reset). 
The rationale behind this scheme is that when an application 
has a large number of consecutive requests served within a 
short time period, which is typical of applications with high 
memory intensity or high row-buffer locality, it delays other 
applications’ requests, thereby stalling their progress. Hence, 
identifying and essentially blacklisting such interference- 
causing applications by placing them in a separate group and 
deprioritizing requests of this blacklisted group can prevent 
such applications from hogging the memory bandwidth. As a 
result, the interference experienced by vulnerable applications 
is mitigated. The blacklisting classihcation is cleared periodi¬ 
cally, at short time intervals (on the order of 1000s of cycles) 
in order not to deprioritize an application for too long of a 
time period to cause unfairness or starvation. Such clearing 
and re-evaluation of application classihcation at short time 
intervals signihcantly reduces unfair application slowdowns 
(as we quantitatively show in Section |7.7| l, while reducing 
complexity compared to tracking per-application metrics such 
as memory intensity. 

Summary of Key Observations. In summary, we make two 
key novel observations that lead to our design in Section 
First, separating applications into only two groups can lead to a 
less complex and more fair and higher performance scheduler. 
Second, the two application groups can be formed seamlessly 
by monitoring the number of consecutive requests served from 
an application and deprioritizing the ones that have too many 
requests served in a short time interval. 

4 Mechanism 

In this section, we present the details of our Blacklisting 
memory scheduler (BLISS) that employs a simple grouping 
scheme motivated by our key observations from Section 
The basic idea behind BLISS is to observe the number of 
consecutive requests served from an application over a short 
time interval and blacklist applications that have a relatively 
large number of consecutive requests served. The black¬ 
listed (interference-causing) and non-blacklisted (vulnerable- 
to-interference) applications are thus separated into two dif¬ 
ferent groups. The memory scheduler then prioritizes the non- 
blacklisted group over the blacklisted group. The two main 
components of BLISS are i) the blacklisting mechanism and 
ii) the memory scheduling mechanism that schedules requests 
based on the blacklisting mechanism. We describe each in turn. 

4.1 The Blacklisting Mechanism 

The blacklisting mechanism needs to keep track of three 
quantities: 1) the application (i.e., hardware context) ID of 
the last scheduled request {Application /dQ 2) the number of 
requests served from an application {#Requests Served), and 
3) the blacklist status of each application. 

When the memory controller is about to issue a request, it 
compares the application ID of the request with the Applica¬ 
tion ID of the last scheduled request. 

4. An application here denotes a hardware context. There can be as 
many applications executing actively as there are hardware contexts. Multiple 
hai'dware contexts belonging to the same application are considered separate 
applications by our mechanism, but our mechanism can be extended to deal 
with such multithreaded applications. 


• If the application IDs of the two requests are the same, the 
#Requests Served counter is incremented. 

• If the application IDs of the two requests are not the 
same, the #Requests Served counter is reset to zero and 
the Application ID register is updated with the application 
ID of the request that is being issued. 

If the #Requests Served exceeds a Blacklisting Threshold 
(set to 4 in most of our evaluations): 

• The application with ID Application ID is blacklisted (clas¬ 
sihed as interference-causing). 

• The #Requests Served counter is reset to zero. 

The blacklist information is cleared periodically after every 
Clearing Interval (set to 10000 cycles in our major evalua¬ 
tions). 

4.2 Blacklist-Based Memory Scheduling 

Once the blacklist information is computed, it is used 
to determine the scheduling priority of a request. Memory 
requests are prioritized in the following order: 

1) Non-blacklisted applications’ requests 

2) Row-buffer hit requests 

3) Older requests 

Prioritizing requests of non-blacklisted applications over re¬ 
quests of blacklisted applications mitigates interference. Row- 
buffer hits are then prioritized to optimize DRAM bandwidth 
utilization. Finally, older requests are prioritized over younger 
requests for forward progress. 

5 Implementation 


The Blacklisting memory scheduler requires additional stor¬ 
age (hip Hops) and logic over an FRFCFS scheduler to 1) 
perform blacklisting and 2) prioritize non-blacklisted applica¬ 
tions’ requests. We analyze the storage and logic cost of it. 

5.1 Storage Cost 


In order to perform blacklisting, the memory scheduler 
needs the following storage components: 

• one register to store Application ID 

• one counter for ^Requests Served 

• one register to store the Blacklisting Threshold that deter¬ 
mines when an application should be blacklisted 

• a blacklist bit vector to indicate the blacklist status of each 
application (one bit for each hardware context) 

In order to prioritize non-blacklisted applications’ requests, 
the memory controller needs to store the application ID 
(hardware context ID) of each request so it can determine the 
blacklist status of the application and appropriately schedule 
the request. 


5.2 Logic Cost 


The memory scheduler requires comparison logic to 

• determine when an application’s #Requests Served exceeds 
the Blacklisting Threshold and set the bit corresponding to 
the application in the Blacklist bit vector. 

• prioritize non-blacklisted applications’ requests. 

We provide a detailed quantitative evaluation of the hard¬ 
ware area cost and logic latency of implementing BLISS and 


previously proposed memory schedulers, in Section 7.2 
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6 Methodology 

6.1 System Configuration 

We model the DRAM memory system using a cycle- 
level in-house DDR3-SDRAM simulator. The simulator was 
validated against Micron’s behavioral Verilog model ll25l and 
DRAMSim2 1^ . This DDR3 simulator is integrated with a 
cycle-level in-house simulator that models out-of-order exe¬ 
cution cores, driven by a Pin ll23l tool at the frontend. Each 
core has a private cache of 512 KB size. We present most of 
our results on a system with the DRAM main memory as the 
only shared resource in order to isolate the effects of memory 
bandwidth interference on application performance. We also 
present results with shared caches in Section |7.1 1 Table [T] 
provides more details of our simulated system. We perform 
most of our studies on a system with 24 cores and 4 channels. 
We provide a sensitivity analysis for a wide range of core and 
channel counts, in Section |7.11| Each channel has one rank 
and each rank has eight banks. We stripe data across channels 
and banks at the granularity of a row. 


Processor 

16-64 cores, 5.3GHz, 3-wide issue, 

8 MSHRs, 128-entry instruction window 

Last-level cache 

64B cache-line, 16-way associative, 

512KB private cache-slice per core 

Memory controller 

128-entry read/write request queue per controller 

Memory 

Timing: DDR3-1066 (8-8-8) (2^ 

Organization: 1-8 channels, 1 rank-per-channel, 

8 banks-per-rank, 8 KB row-buffer 


TABLE 1: Conhguration of the simulated system 

6.2 Workloads 

We perform our main studies using 24-core multipro- 
grammed workloads made of applications from the SPEC 
CPU2006 suite gOl, TPC-C, Matlab and the NAS parallel 
benchmark suite mQ We classify a benchmark as memory¬ 
intensive if it has a Misses Per Kilo Instruction (MPKI) greater 
than 5 and memory-non-intensive otherwise. We construct 
four categories of workloads (with 20 workloads in each 
category), with 25, 50, 75 and 100 percent of memory¬ 
intensive applications. This makes up a total of 80 workloads 
with a range of memory intensities, constructed using random 
combinations of benchmarks, modeling a cloud computing like 
scenario where workloads of various types are consolidated on 
the same node to improve efficiency. We also evaluate 16-, 32- 
and 64- core workloads, with different memory intensities, 
created using a similar methodology as described above for 
the 24-core workloads. We simulate each workload for 100 
million representative cycles, as done by previous studies in 
memory scheduling lISTl fTTl fTSl . 

6.3 Metrics 

We quantitatively compare BLISS with previous memory 
schedulers in terms of system performance, fairness and com¬ 
plexity. We use the weighted speedup 0 il ED metric to 
measure system performance. We use the maximum slowdown 
metric 0 [ni [n H to measure unfairness. We report 
the harmonic speedup metric m as another measure of 
system performance. The harmonic speedup metric also serves 
as a measure of balance between system performance and 
fairness ED. We report area in micrometer'^ {um?) and 
scheduler critical path latency in nanoseconds (ns) as measures 
of complexity. 


5. Each benchmai'k is single threaded. 


6.4 RTL Synthesis Methodology 

In order to obtain timing/area results for BLISS and previ¬ 
ous schedulers, we implement them in Register Transfer Level 
(RTL), using Verilog. We synthesize the RTL implementations 
with a commercial 32 nm standard cell library, using the 
Design Compiler tool from Synopsys. 

6.5 Mechanism Parameters 


Eor BLISS, we use a value of four for Blacklisting Thresh¬ 
old, and a value of 10000 cycles for Clearing Interval. 
These values provide a good balance between performance 
and fairness, as we observe from our sensitivity studies in 
Section 7.12 Eor the other schedulers, we tuned their param¬ 
eters to achieve high performance and fairness on our system 
conhgurations and workloads. We use a Marking-Cap of 5 for 
PARBS, cap of 4 for ERECES-Cap, HistoryWeight of 0.875 
for ATLAS, ClusterThresh of 0.2 and Shuffleinterval of 1000 
cycles for TCM. 

7 Evaluation 

We compare BLISS with hve previously proposed memory 
schedulers, ERECES, ERECES with a cap (ERECES-Cap) BOl . 
PARBS, ATLAS and TCM. ERECES-Cap is a modihed ver¬ 
sion of ERECES that caps the number of consecutive row- 
buffer hitting requests that can be served from an applica¬ 
tion EO). Eigure shows the average system performance 
(weighted speedup and harmonic speedup) and unfairness 
(maximum slowdown) across all our workloads. Eigure 
shows a Pareto plot of weighted speedup and maximum 
slowdown. We make three major observations. Eirst, BLISS 
achieves 5% better weighted speedup, 25% lower maximum 
slowdown and 19% better harmonic speedup than the best 
performing previous scheduler (in terms of weighted speedup), 
TCM, while reducing the critical path and area by 79% and 
43% respectively (as we will show in Section [7^. Therefore, 
we conclude that BLISS achieves both high system perfor¬ 
mance and fairness, at low hardware cost and complexity. 

Second, BLISS signihcantly outperforms all these hve pre¬ 
vious schedulers in terms of system performance, however, it 
has 10% higher unfairness than PARBS, the previous scheduler 
with the least unfairness. PARBS creates request batches 
containing the oldest requests from each application. Older 
batches are prioritized over newer batches. However, within 
each batch, individual applications’ requests are ranked and 
prioritized based on memory intensity. PARBS aims to pre¬ 
serve fairness by batching older requests, while still employing 
ranking within a batch to prioritize low-memory-intensity 
applications. We observe that the batching aspect of PARBS is 
quite effective in mitigating unfairness, although it increases 
complexity. This unfairness reduction also contributes to the 
high harmonic speedup of PARBS. However, batching restricts 
the amount of request reordering that can be achieved through 
ranking. Hence, low-memory-intensity applications that would 
beneht from prioritization via aggressive request reordering 
have lower performance. As a result, PARBS has 8% lower 
weighted speedup than BLISS. Eurthermore, PARBS has a 
6.5x longer critical path and ~2x greater area than BLISS, 
as we will show in Section |7.2| Therefore, we conclude that 
BLISS achieves better system performance than PARBS, at 
much lower hardware cost, while slightly trading off fairness. 

Third, BLISS has 4% higher unfairness than ERECES- 
Cap, but it also 8% higher performance than ERECES-Cap. 
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Fig. 4; System performance and fairness of BLISS compared to previous schedulers 
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Fig. 5: Pareto plot of system performance and fairness 


FRFCFS-Cap has higher fairness than BLISS since it re¬ 
stricts the length of only the ongoing row hit streak, whereas 
blacklisting an application can deprioritize the application/or 
a longer time, until the next clearing interval. As a result, 
FRFCFS-Cap slows down high-row-buffer-locality applica¬ 
tions to a lower degree than BLISS. However, restricting 
only the on-going streak rather than blacklisting an interfering 
application for a longer time causes more interference to 
other applications, degrading system performance compared 
to BLISS. Furthermore, FRFCFS-Cap is unable to mitigate 
interference due to applications with high memory inten¬ 
sity yet low-row-buffer-locality, whereas BLISS is effective 
in mitigating interference due to such applications as well. 
Hence, we conclude that BLISS achieves higher performance 
(weighted speedup) than FRFCFS-Cap, while slightly trading 
off fairness. 

7.1 Analysis of Individual Workloads 

In this section, we analyze the performance and fairness for 
individual workloads, when employing different schedulers. 
Figure shows the performance and fairness normalized to 
the baseline FRFCFS scheduler for all our 80 workloads, for 
BLISS and previous schedulers, in the form of S-curves 1^ . 
The workloads are sorted based on the performance improve¬ 
ment of BLISS. We draw three major observations. First, 
BLISS achieves the best performance among all previous 
schedulers for most of our workloads. For a few workloads, 
ATLAS achieves higher performance, by virtue of always 
prioritizing applications that receive low memory service. 
However, always prioritizing applications that receive low 
memory service can unfairly slow down applications with high 
memory intensities, thereby degrading fairness significantly 
(as shown in the maximum slowdown plot. Figure [^bottom). 
Second, BLISS achieves significantly higher fairness than AT¬ 
LAS and TCM, the best-performing previous schedulers, while 
also achieving higher performance than them and approaches 
the fairness of the fairest previous schedulers, PARBS and 
FRFCFS-Cap. As described in the analysis of average per¬ 
formance and fairness results above, PARBS, by virtue of 
request batching and FRFCFS-Cap, by virtue of restricting 
only the current row hit streak achieve higher fairness (lower 


maximum slowdown) than BLISS for a number of workloads. 
However, these schedulers achieve higher fairness at the cost 
of lower system performance, as shown in Figure Third, 
for some workloads with very high memory intensities, the 
default FRFCFS scheduler achieves the best fairness. This is 
because memory bandwidth becomes a very scarce resource 
when the memory intensity of a workload is very high. Hence, 
prioritizing row hits utilizes memory bandwidth efficiently for 
such workloads, thereby resulting in higher fairness. Based 
on these observations, we conclude that BLISS achieves the 
best performance and a good trade-off between fairness and 
performance for most of the workloads we examine. 

7.2 Hardware Complexity 

Figures |7] and show the critical path latency and area 
of five previous schedulers and BLISS for a 24-core system 
for every memory channel. We draw two major conclu¬ 
sions. First, previously proposed ranking-based schedulers, 
PARBS/ATLAS/TCM, greatly increase the critical path la¬ 
tency and area of the memory scheduler: by 1 lx/5.3x/8.1x 
and 2.4x/L7x/L8x respectively, compared to FRFCFS and 
FRFCFS-Cap, whereas BLISS increases latency and area by 
only L7x and 3.2% over FRFCFS/FRFCFS-Cap|^ Second, 
PARBS, ATLAS and TCM cannot meet the stringent worst- 
case timing requirements posed by the DDR3 and DDR4 
standards MM- In the case where every request is a row- 
buffer hit, the memory controller would have to schedule a 
request every read-to-read cycle time (tcco), the minimum 
value of which is 4 cycles for both DDR3 and DDR4. TCM 
and ATLAS can meet this worst-case timing only until DDR3- 
800 (read-to-read cycle time of 10 ns) and DDR3-1333 (read- 
to-read cycle time of 6 ns) respectively, whereas BLISS can 
meet the worst-case timing all the way down to the highest 
released frequency for DDR4, DDR4-3200 (read-to-read time 
of 2.5 ns). Hence, the high critical path latency of PARBS, 
ATLAS and TCM is a serious impediment to their adoption in 
today’s and future memory interfaces. Techniques like pipelin¬ 
ing could potentially be employed to reduce the critical path 
latency of these previous schedulers. However, the additional 
flops required for pipelining would increase area, power and 
design effort significantly. Therefore, we conclude that BLISS, 
with its greatly lower complexity and cost as well as higher 
system performance and competitive or better fairness, is a 
more effective alternative to state-of-the-art application-aware 
memory schedulers. 

7.3 Analysis of Trade-offs Between Performance, 
Fairness and Complexity 

In the previous sections, we studied the performance, fair¬ 
ness and complexity of different schedulers individually. In 

6. The area numbers are for the lowest value of critical path latency that 
the scheduler is able to meet. 
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Fig. 6: System performance 
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Fig. 7: Critical path: BLISS vs. previous schedulers 
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Fig. 8: Area: BLISS vs. previous schedulers 


this section, we will analyze the trade-offs between these 
metrics for different schedulers. Figure shows a three- 
dimensional radar plot with performance, fairness and sim¬ 
plicity on three different axes. On the fairness axis, we plot 
the negative of the maximum slowdown numbers, and on the 
simplicity axis, we plot the negative of the critical path latency 
numbers. Hence, the ideal scheduler would have high perfor¬ 
mance, fairness and simplicity, as indicated by the encompass¬ 
ing, dashed black triangle. We draw three major conclusions 
about the different schedulers we study. First, application- 
unaware schedulers, such as FRFCFS and FRFCFS-Cap, are 
simple. However, they have low performance and/or fairness. 
This is because, as described in our performance analysis 
above, FRFCFS allows long streaks of row hits from one ap¬ 
plication to cause interference to other applications. FRFCFS- 
Cap attempts to tackle this problem by restricting the length of 
current row hit streak. While such a scheme improves fairness, 
it still does not improve performance signihcantly. Second, 
application-aware schedulers, such as PARBS, ATLAS and 
TCM, improve performance or fairness by ranking based on 
applications’ memory access characteristics. However, they do 
so at the cost of increasing complexity (reducing simplicity) 
signihcantly, since they employ a full ordered ranking across 
all applications. Third, BLISS, achieves high performance and 
fairness, while keeping the design simple, thereby approaching 
the ideal scheduler design (i.e., leading to a triangle that 
is closer to the ideal triangle). This is because BLISS re¬ 
quires only simple hardware changes to the memory controller 
to blacklist applications that have long streaks of requests 
served, which effectively mitigates interference. Therefore, 


FRFCFS. ATLAS 

FRFCFS-Cap . TCM 


PARBS - BLISS - 



and fairness for all workloads 

we conclude that BLISS achieves the best trade-off between 
performance, fairness and simplicity. 


Fairness 

(negative of —FRFCFS 



path latency) 

Fig. 9: Performance, fairness and simplicity trade-offs 


7.4 Understanding the Benefits of BLISS 


We present the distribution of the number of consecutive 
requests served (streaks) from individual applications to bet¬ 
ter understand why BLISS effectively mitigates interference. 


Figure 10 shows the distribution of requests served across 
different streak lengths ranging from 1 to 16 for FRFCFS, 
PARBS, TCM and BLISS for six representative applications 
from the same 24-core workload^ The hgure captions in¬ 
dicate the memory intensity, in misses per kilo instruction 
(MPKI) and row-buffer hit rate (RBH) of each application 
when it is run alone. Figures |10a[ |10b| and |10c| show the 
streak length distributions of applications that have a tendency 
to cause interference {libquantum, mcf and Ibm). All these 
applications have high memory intensity and/or high row- 
buffer locality. Figures lOd lOe and lOf show applications 


that are vulnerable to interference (calculix, cactusADM and 
sphinxS). These applications have lower memory intensities 
and row-buffer localities, compared to the interference-causing 
applications. We observe that BLISS shifts the distribution 
of streak lengths towards the left for the interference-causing 
applications, while it shifts the streak length distribution to 
the right for the interference-prone applications. Hence, BLISS 
breaks long streaks of consecutive requests for interference- 
causing applications, while enabling longer streaks for vulner¬ 
able applications. This enables such vulnerable applications 
to make faster progress, thereby resulting in better system 
performance and fairness. We have observed similar results 
for most of our workloads. 


7. A value of 16 captures streak lengths 16 and above. 
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(d) calculix (MPKI: 0.1; RBH: 85%) 


(e) sphinxS (MPKI: 24; RBH: 91%) (f) cactusADM (MPKI: 7; RBH: 49%) 


Fig. 10; Distribution of streak lengths 


7.5 Average Request Latency 

In this section, we evaluate the average memory request 
latency (from when a request is generated until when it is 
served) metric and seek to understand its correlation with 


performance and fairness. Figure 11 presents the average 


memory request latency (from when the request is generated 
until when it is served) for the hve previously proposed 
memory schedulers and BLISS. Two major observations are 
in order. First, FRFCFS has the lowest average request latency 
among all the schedulers. This is expected since FRFCFS 
maximizes DRAM throughput by prioritizing row-buffer hits. 
Hence, the number of requests served is maximized overall 
(across all applications). However, maximizing throughput 
(i.e., minimizing overall average request latency) degrades the 
performance of low-memory-intensity applications, since these 
applications’ requests are often delayed behind row-buffer 
hits and older requests. This results in degradation in system 
performance and fairness, as shown in Figure 


FRFCFS 

FRFCFS-Ci 

PARE 



Fig. 11; The Average Request Latency Metric 
Second, ATLAS and TCM, memory schedulers that prior¬ 
itize requests of low-memory-intensity applications by em¬ 
ploying a full ordered ranking achieve relatively low av¬ 
erage latency. This is because these schedulers reduce the 
latency of serving requests from latency-critical, low-memory- 
intensity applications signihcantly. Furthermore, prioritizing 
low-memory-intensity applications’ requests does not increase 
the latency of high-memory-intensity applications signih¬ 
cantly. This is because high-memory-intensity applications 
already have high memory access latencies (even when run 
alone) due to queueing delays. Hence, average request latency 
does not increase much from deprioritizing requests of such 
applications. However, always prioritizing such latency-critical 
applications results in lower memory throughput for high- 
memory-intensity applications, resulting in unfair slowdowns 


(as we show in Figure |^. Third, memory schedulers that 
provide the best fairness, PARES, FRFCFS-Cap and BLISS 
have high average memory latencies. This is because these 
schedulers, while employing techniques to prevent requests of 
vulnerable applications with low memory intensity and low 
row-buffer locality from being delayed, also avoid unfairly 
delaying requests of high-memory-intensity applications. As 
a result, they do not reduce the request service latency of 
low-memory-intensity applications signihcantly, at the cost of 
denying memory throughput to high-memory-intensity appli¬ 
cations, unlike ATLAS or TCM. Based on these observations, 
we conclude that while some applications beneht from low 
memory access latencies, other applications beneht more from 
higher memory throughput than lower latency. Hence, average 
memory latency is not a suitable metric to estimate system 
performance or fairness. 

7.6 Impact of Clearing the Blacklist Asynchronously 

The Blacklisting scheduler we have presented and eval¬ 
uated so far clears the blacklisting information periodically 
(every 10000 cycles in our evaluations so far), such that all 
applications are removed from the blacklist at the end of a 
Clearing Interval. In this section, we evaluate an alternative 
design where an individual application is removed from the 
blacklist Clearing Interval cycles after it has been blacklisted 
(independent of the other applications). In order to implement 
this alternative design, each application would need an addi¬ 
tional counter to keep track of the number of remaining cycles 
until the application would be removed from the blacklist. This 
counter is set (to the Clearing Interval) when an application 
is blacklisted and is decremented every cycle until it becomes 
zero. When it becomes zero, the corresponding application is 
removed from the blacklist. We use a Clearing Interval of 
10000 cycles for this alternative design as well. 

Table shows the system performance and fairness of the 
original BLISS design (BLISS) and the alternative design in 
which individual applications are removed from the blacklist 
asynchronously (BLISS-Individual-Clearing). As can be seen, 
the performance and fairness of the two designs are similar. 
Furthermore, the hrst design (BLISS) is simpler since it does 
not need to maintain an additional counter for each application. 
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Fig. 12; Comparison with TCM’s clustering mechanism 



Fig. 13: Comparison with FRFCFS-Cap combined with blacklisting 


We conclude that the original BLISS design is more efficient, 
in terms of performance, fairness and complexity. 


Metric 

BLISS 

BLISS-Individual-Clearing 

Weighted Speedup 

9.18 

9.12 

Maximum Slowdown 

6.54 

6.60 


TABLE 2: Clearing the blacklist asynchronously 


7.7 Comparison with TCM’s Ciustering Mechanism 

Figure [T^ shows the system performance and fairness 
of BLISS, TCM and TCM’s clustering mechanism (TCM- 
Cluster). TCM-Cluster is a modihed version of TCM that 
performs clustering, but does not rank applications within 
each cluster. We draw two major conclusions. First, TCM- 
Cluster has similar system performance as BLISS, since both 
BLISS and TCM-Cluster prioritize vulnerable applications by 
separating them into a group and prioritizing that group rather 
than ranking individual applications. Second, TCM-Cluster 
has signihcantly higher unfairness compared to BLISS. This 
is because TCM-Cluster always deprioritizes high-memory- 
intensity applications, regardless of whether or not they are 
causing interference (as described in Observation 2 in Sec¬ 
tion [^. BLISS, on the other hand, observes an application at 
hne time granularities, independently at every memory channel 
and blacklists an application at a channel only when it is 
generating a number of consecutive requests (i.e., potentially 
causing interference to other applications). 

7.8 Evaluation of Row Hit Based Blacklisting 

BLISS, by virtue of restricting the number of consecutive 

requests that are served from an application, attempts to mit¬ 
igate the interference caused by both high-memory-intensity 
and high- row-buffer-locality applications. In this section, we 
attempt to isolate the benehts from restricting consecutive row- 
buffer hitting requests vs. non-row-buffer hitting requests. To 
this end, we evaluate the performance and fairness benehts of 
a mechanism that places an application in the blacklist when a 
certain number of row-buffer hitting requests (N) to the same 
row have been served for an application (we call this FRFCFS- 
Cap-Blacklisting as the scheduler essentially is FRFCFS-Cap 
with blacklisting). We use an N value of 4 in our evaluations. 

Figure compares the system performance and fairness 
of BLISS with FRFCFS-Cap-Blacklisting. We make three 
major observations. First, FRFCFS-Cap-Blacklisting has sim¬ 
ilar system performance as BLISS. On further analysis of 
individual workloads, we hnd that FRFCFS-Cap-Blacklisting 


blacklists only applications with high row-buffer locality, 
causing requests of non-blacklisted high-memory-intensity ap¬ 
plications to interfere with requests of low-memory-intensity 
applications. However, the performance impact of this inter¬ 
ference is offset by the performance improvement of high- 
memory-intensity applications that are not blacklisted. Second, 
FRFCFS-Cap-Blacklisting has higher unfairness (higher max¬ 
imum slowdown and lower harmonic speedup) than BLISS. 
This is because the high-memory-intensity applications that 
are not blacklisted are prioritized over the blacklisted high- 
row-buffer-locality applications, thereby interfering with and 
slowing down the high- row-buffer-locality applications sig¬ 
nihcantly. Third, FRFCFS-Cap-Blacklisting requires a per- 
bank counter to count and cap the number of row-buffer hits, 
whereas BLISS needs only one counter per-channel to count 
the number of consecutive requests from the same application. 
Therefore, we conclude that BLISS is more effective in miti¬ 
gating unfairness while incurring lower hardware cost, than the 
FRFCFS-Cap-Blacklisting scheduler that we build combining 
principles from FRFCFS-Cap and BLISS. 

7.9 Comparison with Criticality-Aware Scheduling 
We compare the system performance and fairness of BLISS 
with those of criticality-aware memory schedulers ifTOl . The 
basic idea behind criticality-aware memory scheduling is to 
prioritize memory requests from load instructions that have 
stalled the instruction window for long periods of time in 
the past. Ghose et al. cni evaluate prioritizing load requests 
based on both maximum stall time (Crit-MaxStall) and total 
stall time (Crit-TotalStall) caused by load instructions in the 
past. Figure shows the system performance and fairness 
of BLISS and the criticality-aware scheduling mechanisms, 
normalized to FRFCFS, across 40 workloads. Two obser¬ 
vations are in order. First, BLISS signihcantly outperforms 
criticality-aware scheduling mechanisms in terms of both sys¬ 
tem performance and fairness. This is because the criticality- 
aware scheduling mechanisms unfairly deprioritize and slow 
down low-memory-intensity applications that inherently gen¬ 
erate fewer requests, since stall times tend to be low for 
such applications. Second, criticality-aware scheduling incurs 
hardware cost to prioritize requests with higher stall times. 
Specihcally, the number of bits to represent stall times is on 
the order of 12-14, as described in Col. Hence, the logic for 
comparing stall times and prioritizing requests with higher 
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stall times would incur even higher cost than per-application 
ranking mechanisms where the number of bits to represent a 
core’s rank grows only as as log 2 NumberOfCores (e.g. 5 
bits for a 32-core system). Therefore, we conclude that BLISS 
achieves signihcantly better system performance and fairness, 
while incurring lower hardware cost. 


FRFCFS 1-1 

Crit-MaxStall 

Crit-TotalStall 

BLISS 



FRFCFS I-1 

Crit-MaxStall 

Crit-TotalStall 

BLISS 



Fig. 14: Comparison with criticality-aware scheduling 

7.10 Effect of Workload Memory Intensity and Row- 
buffer Locality 

In this section, we study the impact of workload memory 
intensity and row-buffer locality on performance and fairness 
of BLISS and hve previous schedulers. 

Workload Memory Intensity. Figure 15 shows system per¬ 
formance and fairness for workloads with different memory 
intensities, classihed into different categories based on the 
fraction of high-memory-intensity applications in a workload]^ 
We draw three major conclusions. First, BLISS outperforms 
previous memory schedulers in terms of system performance 
across all intensity categories. Second, the system performance 
benehts of BLISS increase with workload memory intensity. 
This is because as the number of high-memory-intensity ap¬ 
plications in a workload increases, ranking individual applica¬ 
tions, as done by previous schedulers, causes more unfairness 
and degrades system performance. Third, BLISS achieves 
signihcantly lower unfairness than previous memory sched¬ 
ulers, except FRFCFS-Cap and PARBS, across all intensity 
categories. Therefore, we conclude that BLISS is effective in 
mitigating interference and improving system performance and 
fairness across workloads with different compositions of high- 
and low-memory-intensity applications. 

FRFCFS I - 1 ATLAS FRFCFS I - 1 ATLAS 

FRFCFS-Cap i- 1 TCM i- 1 FRFCFS-Cap i- 1 TCM i- 1 

PARBS BLISS PARBS BLISS 
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Fig. 15: Sensitivity to workload memory intensity 
Workload Row-buffer Locality. Figure shows the system 
performance and fairness of hve previous schedulers and 
BLISS when the number of high row-buffer locality applica¬ 
tions in a workload is varied^ We draw three observations. 
First, BLISS achieves the best performance and close to 
the best fairness in most row-buffer locality categories. Sec¬ 
ond, BLISS’ performance and fairness benehts over baseline 
FRFCFS increase as the number of high-row-buffer-locality 
applications in a workload increases. As the number of high- 
row-buffer-locality applications in a workload increases, there 
is more interference to the low-row-buffer-locality applications 


8. We classify applications with MPKI less than 5 as low-memory-intensity 
and the rest as high-memory-intensity. 

9. We classify an application as having high row-buffer locality if its row- 
buffer hit rate is greater than 90%. 


that are vulnerable. Hence, there is more opportunity for 
BLISS to mitigate this interference and improve performance 
and fairness. Third, when all applications in a workload have 
high row-buffer locality (100%), the performance and fairness 
improvements of BLISS over baseline FRFCFS are a bit 
lower than the other categories. This is because, when all 
applications have high row-buffer locality, they each hog the 
row-buffer in turn and are not as susceptible to interference 
as the other categories in which there are vulnerable low-row- 
buffer-locality applications. However, the performance/fairness 
benehts of BLISS are still signihcant since BLISS is effective 
in regulating how the row-buffer is shared among different 
applications. Overall, we conclude that BLISS is effective 
in achieving high performance and fairness across workloads 
with different compositions of high- and low-row-buffer- 
locality applications. 
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Fig. 16: Sensitivity to row-buffer locality 

7.11 Sensitivity to System Parameters 

Core and channel count. Figures [TT] and [TS] show the system 
performance and fairness of FRFCFS, PARBS, TCM and 
BLISS for different core counts (when the channel count 
is 4) and different channel counts (when the core count is 
24), across 40 workloads for each core/channel count. The 
numbers over the bars indicate percentage increase or decrease 
compared to FRFCFS. We did not optimize the parameters 
of different schedulers for each conhguration as this requires 
months of simulation time. We draw three major conclusions. 
First, the absolute values of weighted speedup increases with 
increasing core/channel count, whereas the absolute values 
of maximum slowdown increase/decrease with increasing 
core/channel count respectively, as expected. Second, BLISS 
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Fig. 17: Sensitivity to number of cores 



Fig. 18: Sensitivity to number of channels 
achieves higher system performance and lower unfairness than 
all the other scheduling policies (except PARBS, in terms 
of fairness) similar to our results on the 24-core, 4-channel 
system, by virtue of its effective interference mitigation. The 
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only anomaly is that TCM has marginally higher weighted 
speedup than BLISS for the 64-core system. However, this 
increase comes at the cost of signihcant increase in unfairness. 
Third, BLISS’ system performance beneht (as indicated by the 
percentages on top of bars, over FRFCFS) increases when the 
system becomes more bandwidth constrained, i.e., high core 
counts and low channel counts. As contention increases in the 
system, BLISS has greater opportunity to mitigate itp^ 
Cache size. Figure 19 shows the system performance and 


fairness for hve previous schedulers and BLISS with different 
last level cache sizes (private to each core). 



Fig. 19: Sensitivity to cache size 
We make two observations. First, the absolute values of 
weighted speedup increase and maximum slowdown decrease, 
as the cache size becomes larger for all schedulers, as ex¬ 
pected. This is because contention for memory bandwidth re¬ 
duces with increasing cache capacity, improving performance 
and fairness. Second, across all the cache capacity points 
we evaluate, BLISS achieves signihcant performance and 
fairness benehts over the best-performing previous schedulers, 
while approaching close to the fairness of the fairest previous 
schedulers. 


Shared Caches. Figure 20 shows system performance and 
fairness with a 32 MB shared cache (instead of the 512 KB 
per core private caches used in our other experiments). BLISS 
achieves 5%/24% better performance/fairness compared to 
TCM, demonstrating that BLISS is effective in mitigating 
memory interference in the presence of large shared caches 
as well. 


FRFCFS I-f ATLAS ■ 

FRFCFS-CAP I-1 TCM c 

PARBS BLISS I 
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Fig. 20: Performance and fairness with a shared cache 


7.12 Sensitivity to Aigorithm Parameters 

Tables and show the system performance and fairness 
respectively of BLISS for different values of the Blacklisting 
Threshold and Clearing Interval. Three major conclusions are 
in order. First, a Clearing Interval of 10000 cycles provides 
a good balance between performance and fairness. If the 
blacklist is cleared too frequently (1000 cycles), interference- 
causing applications are not deprioritized for long enough, re¬ 
sulting in low system performance. In contrast, if the blacklist 
is cleared too infrequently, interference-causing applications 
are deprioritized for too long, resulting in high unfairness. 
Second, a Blacklisting Threshold of 4 provides a good balance 


10. Fairness benefits reduce at very high core counts and very low channel 
counts, since memoiy bandwidth becomes highly saturated. 


between performance and fairness. When Blacklisting Thresh¬ 
old is very small, applications are blacklisted as soon as they 
have very few requests served, resulting in poor interference 
mitigation as too many applications are blacklisted. On the 
other hand, when Blacklisting Threshold is large, low- and 
high-memory-intensity applications are not segregated effec¬ 
tively, leading to high unfairness. 


-Interval 

ThreshoM~~'~~~~— 

1000 

10000 

100000 

2 

8.76 

8.66 

7.95 

4 

8.61 

9.18 

8.60 

8 

8.42 

9.05 

9.24 


TABLE 3: Perf. sensitivity to threshold and interval 


'' Interval 

Threshold''"''''-'— 

1000 

10000 

100000 

2 

6.07 

6.24 

7.78 

4 

6.03 

6.54 

7.01 

8 

6.02 

7.39 

7.29 


TABLE 4: Unfairness sensitivity to threshold and interval 

7.13 Interleaving and Scheduling Interaction 

In this section, we study the impact of the address inter¬ 
leaving policy on the performance and fairness of different 
schedulers. Our analysis so far has assumed a row-interleaved 
policy, where data is distributed across channels, banks and 
rows at the granularity of a row. This policy optimizes for 
row-buffer locality by mapping a consecutive row of data to 
the same channel, bank, rank. In this section, we will consider 
two other interleaving policies, cache block interleaving and 
sub-row interleaving. 

Interaction with cache block interleaving. In a cache-block- 
interleaved system, data is striped across channels, banks and 
ranks at the granularity of a cache block. Such a policy 
optimizes for bank level parallelism, by distributing data at 
a small (cache block) granularity across channels, banks and 
ranks. 


Eigure 21 shows the system performance and fairness of 
ERECES with row interleaving (ERECES-Row), as a compar¬ 
ison point, hve previous schedulers, and BLISS with cache 
block interleaving. We draw three observations. Eirst, system 
performance and fairness of the baseline ERECES scheduler 
improve signihcantly with cache block interleaving, compared 
to with row interleaving. This is because cache block in¬ 
terleaving enables more requests to be served in parallel at 
the different channels and banks, by distributing data across 
channels and banks at the small granularity of a cache block. 
Hence, most applications, and particularly, applications that 
do not have very high row-buffer locality beneht from cache 
block interleaving. 


FRFCFS-Row c 
FRFCFS 5 
FRFCFS-Cap i 
PARES I 


ATLAS c 
TCM I 
BLISS I 
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Fig. 21: Scheduling and cache block interleaving 
Second, as expected, application-aware schedulers such 
as ATLAS and TCM achieve the best performance among 
previous schedulers, by means of prioritizing requests of 
applications with low memory intensities. However, PARBS 
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and FRFCFS-Cap do not improve fairness over the baseline, 
in contrast to our results with row interleaving. This is because 
cache block interleaving already attempts to provide fairness 
by increasing the parallelism in the system and enabling 
more requests from across different applications to be served 
in parallel, thereby reducing unfair applications slowdowns. 
More specihcally, requests that would be row-buffer hits to the 
same bank, with row interleaving, are now distributed across 
multiple channels and banks, with cache block interleaving. 
Hence, applications’ propensity to cause interference reduces, 
providing lower scope for request capping based schedulers 
such as FRFCFS-Cap and PARES to mitigate interference. 
Third, BLISS achieves within 1.3% of the performance of the 
best performing previous scheduler (ATLAS), while achiev¬ 
ing 6.2% better fairness than the fairest previous scheduler 
(PARES). BLISS effectively mitigates interference by regu¬ 
lating the number of consecutive requests served from high- 
memory-intensity applications that generate a large number of 
requests, thereby achieving high performance and fairness. 
Interaction with sub-row interleaving. While memory 
scheduling has been a prevalent approach to mitigate memory 
interference, previous work has also proposed other solutions, 
as we describe in Section One such previous work by 
Kaseridis et al. Cl proposes minimalist open page, an inter¬ 
leaving policy that distributes data across channels, ranks and 
banks at the granularity of a sub-row (partial row), rather than 
an entire row, exploiting both row-buffer locality and bank- 
level parallelism. We examine BLISS’ interaction with such a 
sub-row interleaving policy. 


Figure 22 shows the system performance and fairness of 
FRFCFS with row interleaving (FRFCFS-Row), FRFCFS with 
cache block interleaving (FRFCFS-Block) and hve previously 
proposed schedulers and BLISS, with sub-row interleaving 
(when data is striped across channels, ranks and banks at the 
granularity of four cache blocks). Three observations are in 
order. First, sub-row interleaving provides signihcant benehts 
over row interleaving, as can be observed for FRFCFS (and 
other scheduling policies by comparing with Figure |^. This 
is because sub-row interleaving enables applications to exploit 
both row-buffer locality and bank-level parallelism, unlike 
row interleaving that is mainly focused on exploiting row- 
buffer locality. Second, sub-row interleaving achieves similar 
performance and fairness as cache block interleaving. We 
observe that this is because cache block interleaving enables 
applications to exploit parallelism effectively, which makes up 
for the lost row-buffer locality from distributing data at the 
granularity of a cache block across all channels and banks. 
Third, BLISS achieves close to the performance (within 1.5%) 
of the best performing previous scheduler (TCM), while reduc¬ 
ing unfairness signihcantly and approaching the fairness of the 
fairest previous schedulers. One thing to note is that BLISS has 
higher unfairness than FRFCFS, when a sub-row-interleaved 
policy is employed. This is because the capping decisions from 
sub-row interleaving and BLISS could collectively restrict 
high-row-buffer locality applications to a large degree, thereby 
slowing them down and causing higher unfairness. Co-design 
of the scheduling and interleaving policies to achieve different 
goals such as performance/faimess is an important area of 
future research. We conclude that a BLISS-like scheduler, with 


its high performance and low complexity is a signihcantly 
better alternative to schedulers such as ATLAS/TCM in the 
pursuit of such scheduling-interleaving policy co-design. 


FRFCFS-Row ; - 1 PARBS 

FRFCFS-Block i-1 ATLAS i-1 

FRFCFS IBM TCM 

FRFCFS-Cap BLISS 




Fig. 22: Scheduling and sub-row interleaving 

8 Related Work 

To our knowledge, BLISS is the hrst memory scheduler 
design that attempts to optimize, at the same time, for high 
performance, fairness and low complexity, which are three 
competing yet important goals. The closest previous works to 
BLISS are other memory schedulers. We have already com¬ 
pared BLISS both qualitatively and quantitatively to previously 
proposed memory schedulers, FRFCFS ||35]|50l, PARBS 131] 
|28l, ATLAS ini, TCM E] and criticality-aware memory 
scheduling na, which have been designed to mitigate inter¬ 
ference in a multicore system. Other previous schedulers im 
l30l 1^ have been proposed earlier that PARBS, ATLAS and 
TCM have been shown to outperform 11311 fTTl fTSl . 

Parallel Application Memory Scheduling (PAMS) lH tack¬ 
les the problem of mitigating interference between different 
threads of a multithreaded application, while Staged Mem¬ 
ory Scheduling (SMS) |I2| attempts to mitigate interference 
between the CPU and GPU in CPU-GPU systems. Princi¬ 
ples from BLISS can be employed in both of these con¬ 
texts to identify and deprioritize interference-causing threads, 
thereby mitigating interference experienced by vulnerable 
threads/applications. FIRM BSl proposes request scheduling 
mechanisms to tackle the problem of heavy write traffic in 
persistent memory systems. BLISS can be combined with 
firm’s write handling mechanisms to achieve better fairness 
in persistent memory systems. Complexity effective memory 
access scheduling Ell attempts to achieve the performance 
of FRFCFS using a First Come First Served scheduler in 
GPU systems, by preventing row-buffer locality from being 
destroyed when data is transmitted over the on-chip network. 
Their proposal is complementary to ours. BLISS could be 
combined with such a scheduler design to prevent threads from 
hogging the row-buffer and banks. 

While memory scheduling is a major solution direction 
towards mitigating interference, previous works have also 
explored other approaches such as address interleaving m, 
memory bank/channel partitioning ll29] [131 l22l l46ll . source 
throttling ||7| |43] 0 |4| 1221 HU and thread scheduling li49l 

imisiiia to mitigate interference. 

Subrow Interleaving: Kaseridis et al. IITtII propose minimalist 
open page, a data mapping policy that interleaves data at the 
granularity of a sub-row across channels and banks such that 
applications with high row-buffer locality are prevented from 
hogging the row buffer, while still preserving some amount of 
row-buffer-locality. We study the interactions of BLISS with 
minimalist open page in Section 7.13 showing BLISS’ benehts 
on a sub-row interleaved memory system. 

Memory Channel/Bank Partitioning: Previous works 
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[TJl I 22 I l46l propose techniques to mitigate inter-application in¬ 
terference by partitioning channels/banks among applications 
such that the data of interfering applications are mapped to 
different channels/banks. Our approach is complementary to 
these schemes and can be used in conjunction with them to 
achieve more effective interference mitigation. 

Source Throttling: Source throttling techniques (e.g., fT] 
E] E] a m Ea [B]) propose to throttle the memory re¬ 
quest injection rates of interference-causing applications at the 
processor core itself rather than regulating an application’s 
access behavior at the memory, unlike memory scheduling, 
partitioning or interleaving. BLISS is complementary to source 
throttling and can be combined with it to achieve better 
interference mitigation. 

OS Thread Scheduling: Previous works |42] |43 pro¬ 
pose to mitigate shared resource contention by co-scheduling 
threads that interact well and interfere less at the shared 
resources. Such a solution relies on the presence of enough 
threads with such symbiotic properties, whereas our proposal 
can mitigate memory interference even if interfering threads 
are co-scheduled. Furthermore, such thread scheduling policies 
and BLISS can be combined in a synergistic manner to further 
improve system performance and fairness. Other techniques 
to map applications to cores to mitigate memory interference, 
such as J5l, can be combined with BLISS. 

9 Conclusion 

We introduce the Blacklisting memory scheduler (BLISS), 
a new and simple approach to memory scheduling in 
systems with multiple threads. We observe that the per- 
application ranking mechanisms employed by previously pro¬ 
posed application-aware memory schedulers incur high hard¬ 
ware cost, cause high unfairness, and lead to high scheduling 
latency to the point that the scheduler cannot meet the fast 
command scheduling requirements of state-of-the-art DDR 
protocols. BLISS overcomes these problems based on the 
key observation that it is sufficient to group applications into 
only two groups, rather than employing a total rank order 
among all applications. Our evaluations across a variety of 
workloads and systems demonstrate that BLISS has better 
system performance and fairness than previously proposed 
ranking-based schedulers, while incurring signihcantly lower 
hardware cost and latency in making scheduling decisions. 
We conclude that BLISS, with its low complexity, high system 
performance and high fairness, can be an efficient and effective 
memory scheduling substrate for current and future multicore 
and multithreaded systems. 
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